c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Size: px

Start display at page:

Download "c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)"

Scott Ball
6 years ago
Views:

1 Chapter 13 Statistical Parsg Given a corpus of trees, it is easy to extract a CFG and estimate its parameters. Every tree can be thought of as a CFG derivation, and we just perform relative frequency estimation (count and divide) on m. That is, let c(a! Ø) be number of times that rule A! Ø was observed, and n c(a) = X c(a! Ø) (13.1) Ø c(a! Ø) ˆP(A! Ø A) = c(a) (13.2) 13.1 Parser evaluation Evaluation of parsers almost always uses labeled precision and recall or labelled F1 score Black et al., To defe this metric, we make use of notion of a multiset, which is a set where items can occur more than once. If A and B are multisets, defe A(x) to be number of times that x occurs A, and defe A = X x A(x) (13.3) (A \ B)(x) = m{a(x),b(x)} (13.4) We view a tree as a multiset of brackets [X,i, j ] for each node of tree, where X is label of node and w i+1 w j is its span. Note that Penn Treebank style trees, every word is an only child and its parent is a part-of-speech tag. The part-of-speech tag nodes (also called pretermal nodes) are not cluded multiset. Let t (for test) be parser output and g (for gold) be gold-standard tree that we are evaluatg agast. Then defe precision p(t, g ) and recall g (t, g ) to be: p(t, g ) = t \ g t r (t, g ) = t \ g g (13.5) (13.6)

2 Chapter 13. Statistical Parsg 87 and F1 score to be ir harmonic mean: F 1 (t, g ) = p(t,g ) + 1 r (t,g ) = 2 t \ g t + g (13.7) (13.8) The typical setup for English parsg is to tra parser on Penn Treebank, Wall Street Journal sections 02 21, to do development on section 00 or 22, and to test on section 23. If we tra a PCFG without any modifications, we will get an F1 score of only 73%. State-of--art scores are above 90% Markovization A PCFG captures dependency between a parent node and all of its children. On Penn Treebank, this leads to over 10,000 rules, each with its own probability. In practice, it turns out that this tends to be both too little and too much Vertical markovization Too see why it can be too little, suppose our Treebank looked like this Johnson, 1998; Kle and Manng, 2003:

3 Chapter 13. Statistical Parsg times car 10 times with car dog From this we would learn ˆP(! ) = 90/310 (13.9) ˆP(! ) = 10/310 (13.10) and whenever parser is asked to choose between se two trees: (13.11) with car dog

4 Chapter 13. Statistical Parsg 89 (13.12) with dog car it will prefer second one, which was never observed trag data! This can be corrected by modifyg node labels to crease ir sensitivity to ir vertical context, much same way that we can crease context-sensitivity of an n-gram language model by creasg n. We simply annotate each node with its parent s label. For example (assumg that parent of upper is VP): (13.13) [mom = VP] [mom = ] [mom = ] [mom = ] [mom = ] [mom = ] [mom = ] [mom = ] [mom = ] car Now, parser will not be tempted to build a three-level (because it would require an [mom = ] with an [mom = ] child, which is rare). We tra PCFG on se annotated trees, and n after we parse test data, we have to remove annotations before evaluation. This helps accuracy of parser considerably (to about 77% F1) Barization and horizontal markovization On or hand, our PCFG also captures too much dependency. Suppose Treebank contas tree fragment

5 Chapter 13. Statistical Parsg 90 (13.14) JJS tallest steel buildg N America but never contas (13.15) JJS Then parser will fail tryg to parse: (13.16) JJS tallest buildg N America The problem is that if we allow long rules, n re are y possible long rules, which our models says are all dependent. But we believe that re is some relationship between m. The solution is to break down long rules to smaller rules, just as we did to reduce parsg complexity. Here, it s easier to barize trees stead of barizg grammar. For example, to barize (13.14), we troduce new nodes, and annotate each one with children that have been generated so far:

6 Chapter 13. Statistical Parsg 91 (13.17) [prev = ] JJS [prev =,JJS] tallest [prev =,JJS,] steel [prev =,JJS,,] buildg N America Note that re is enough formation annotations to reverse barization. So much formation, fact, that we still can t parse (13.16). We can aga apply an idea from language modelg, this time horizontal direction: make generation of each child depend only on previous (n 1) children Miller et al., 1996; Colls, 1999; Kle and Manng, For example, if n = 2: (13.18) [prev = ] JJS [prev = JJS] tallest [prev = ] steel [prev = ] buildg N America

7 Chapter 13. Statistical Parsg 92 Now we can parse (13.16), and parser accuracy should be a little bit better Usg lguistic knowledge Previously we saw how to crease amount of vertical context dependency a PCFG by changg it, effectively, from a bigram model to a trigram model, and how to decrease amount of horizontal context dependency by changg it, effectively, from a 1-gram model to a bigram model. We can try to use lguistic knowledge to make se context dependencies more telligent Lexicalization In vertical direction, a common technique is lexicalization (sometimes called head-lexicalization to distguish it from anor concept with same name). In English parsg, attachment is one of most difficult ambiguities to resolve, as illustrated by well-known sentence: (13.19) S VP PRP VBD I saw a with a telescope (13.20) S VP PRP VP I VBD saw with a a telescope

8 Chapter 13. Statistical Parsg 93 Although re is a strong general preference for low attachment (13.19), words volved may change this preference. For example, after would have a defite preference for attachg to VP. (13.21) S VP PRP VP I VBD fed after N mogwai midnight Last time, we annotated each node with label of its parent; now, we go opposite direction, annotatg each node with label of one of its leaves. Which one? We choose lguistically most important one, known as its head word, usg some heuristics (e.g., head of a VP is verb; head of an is fal noun). For example, tree (13.21) would become: (13.22) S[head = saw] [head = I] VP[head = fed] PRP[head = I] VP[head = fed] [head = after] I VBD[head = fed] [head = mogwai] [head = after] [head = midnight] fed [head = ] [head = mogwai] after [head = midnight] mogwai midnight What did this buy us? We are gog to learn a high probability for rules like and low probability for rules like VP[head = w]! VP[head = w] [head = after] (13.23) [head = w]! [head = w] [head = after] (13.24) so that we can learn that s headed by after prefer to attach to VPs stead of s. If we barize, it is convenient to barize so that head is generated last (lowest). Thus:

9 Chapter 13. Statistical Parsg 94 (13.25) JJ little house on N (13.26) prairie JJ [left = JJ] little [left = JJ, right = ] house on N prairie Subcategorization In horizontal direction, a common technique is to use subcategorization. The basic idea is that some phrases (called arguments) are required and ors (called adjuncts) are optional: (13.27) Godzilla obliterated city (13.28)? Godzilla obliterated The verb obliterated normally takes a direct object, makg second sentence odd. On or hand, sentences (13.29) Godzilla exists (13.30) * Godzilla exists monster verb exists never takes a direct object. By contrast, adjuncts can occur much more freely: (13.31) Godzilla exists today (13.32) Godzilla obliterated city today This can affect parsg decisions. For example,

10 Chapter 13. Statistical Parsg 95 (13.33) I saw her duck (13.34) I obliterated her duck The first sentence is ambiguous for hus because saw can take eir an or an S as an argument. The second sentence is unambiguous for hus, but ambiguous for computers unless y learn that obliterated must take an argument, not an S argument. Last time, we made generation of a child node depend on one previous child. Now, we would like to use this same mechanism to control number of arguments, dependg on verb. We can do this by makg generation of a child node depend on all of previous arguments, and none of previous adjuncts. I ve left off some annotations to save space: (13.35) S VP N VBD Godzilla obliterated city today (13.36) S[head = obliterated] VP[head = obliterated] N VP[head = obliterated] Godzilla VP[head = obliterated, right = ] [head = city, arg] VBD[head = obliterated] [head = city] today obliterated city We marked [ city] with an arg feature to dicate that it is an argument, not an adjunct. Moreover, right feature, and left feature if re were one, only keeps track of previous arguments, not adjuncts Smoothg With complex nontermals we have been creatg, it may become hard to reliably estimate rule probabilities from data. The solution is to apply smoothg, as language modelg. Witten-Bell smooth-

11 Chapter 13. Statistical Parsg 96 g is a fairly common choice parsg. For example, to estimate probability of VP[head = obliterated]! VP[head = obliterated, right = ] [head = city, arg] we might terpolate its relative-frequency estimate with that of VP[head = w]! VP[head = w,right = ] [head = city,arg] where we have replaced word obliterated with a placeholder w to make rule probability easier to estimate. If we test our parser on unseen data, it is evitable that it will encounter unseen words. If we don t do anythg about it, parser will simply reject any strg that has an unknown word, which is obviously bad. The simplest thg to do is to simulate unknown words trag data. That is, trag data, replace every word that occurs only once (or k times) with a special symbol <unk>. Then tra PCFG as usual. Then, test data, replace all unknown words with <unk>. It s also fe to use multiple unknown symbols. For example, we can replace words endg -g with <unk-g>. A more sophisticated approach would be to apply some of ideas that we saw language modelg Beam search The Viterbi CKY algorithm can be slow, especially if modifications to grammar crease nontermal alphabet a lot. We can use beam search to speed up search if we are willg to allow potential search errors. After completion of each chart cell best[i, j ], do followg: 1: for all X 2 N do 2: score[x ] best[i, j ][X ] h(x ) 3: end for 4: choose mscore 5: for all X 2 N do 6: if score[x ] < mscore n 7: end if 8: delete best[i, j ][X ] 9: delete back[i, j ][X ] 10: end for The function h(x ) is called a heuristic function and is meant to estimate relative probability of gettg from S at root down to X. The typical thg to do is to let h(x ) be frequency of X trag data. There are two common ways of choosg mscore (le 4): µ mscore = max score[x ] Ø, where 0 < Ø < 1 (typical values: 10 3 to 10 5 ) X mscore is score of b th best member of score (typical values of b: ) It is also fe to set mscore to larger of se two values.

12 Chapter 13. Statistical Parsg 97 Question The time complexity of CKY is normally O(n 3 N 3 ), because we have to loop over i, j,k, X,Y, and Z. If we add beam search, what will time complexity be terms of n and b? Assume b < N.

13 Bibliography Black, E. et al. (1991). A procedure for quantitatively comparg syntactic coverage of English grammars. In: Proc. DARPA Speech and Natural Language Workshop, pp Colls, Michael (1999). Head-Driven Statistical Models for Natural Language Parsg. PhD sis. University of Pennsylvania. Johnson, Mark (1998). PCFG models of lguistic tree representations. In: Computational Lguistics 24, pp Kle, Dan and Christopher D. Manng (2003). Accurate Unlexicalized Parsg. In: Proc. ACL, pp Miller, Scott et al. (1996). A Fully Statistical Approach to Natural Language Interfaces. In: Proc. ACL, pp

A Context-Free Grammar

Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily